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Abstract 

The  research  contribution  of  this  thesis  is  the  first  known  integrated  architecture 
and  feature  selection  algorithm  for  Radial  Basis  Neural  Networks  (RBNN’s).  The 
objective  is  to  apply  the  network  iteratively  to  determine  the  final  architecture  and  feature 
set  used  to  evaluate  a  problem.  Additionally,  this  thesis  compares  three  different 
classification  techniques.  Discriminant  Analysis  (DA),  Feed-Forward  Neural  Networks 
(FFN)  and  RBNN’s  against  several  hard  to  solve  problems.  These  problems  were  used  to 
evaluate  general  classifier  performance  as  well  as  the  performance  of  the  feature 
selection  techniques. 

This  thesis  describes  the  classification  techniques  as  well  as  the  measures  used  to 
evaluate  them.  It  next  develops  a  new  clustering  technique  used  to  determine  the 
network  architecture  and  the  saliency  measure  used  to  select  features  for  RBNN’s.  Next, 
the  thesis  applies  these  techniques  to  three  general  problems,  Block-C,  the  University  of 
Wisconsin  Breast  Cancer  Data  (UWBCD)  and  a  noise  corrupted  version  of  Fisher’s  Iris 
problem.  Finally,  the  conclusions  and  recommendations  for  future  research  are  provided. 
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AN  INTEGRATED  ARCHITECTURE  AND  FEATURE  SEEECTION 


AEGORITHM  FOR  RADIAE  BASIS  NEURAE  NETWORKS 


1  Introduction 


1.1  General  Discussion 

The  science  of  classification  deals  with  a  general  class  of  problems  wherein  real- 
world  observations  are  used  to  distinguish  between  two  or  more  classes  of  interest.  One 
example  of  classification  is  a  college  admissions  department  attempting  to  distinguish 
individuals  who  will  graduate  from  those  who  will  not.  Another  example  is  the 
classification  of  certain  cells  as  cancerous  or  benign.  Military  applications  include 
automated  classification  of  images  as  target  or  clutter.  There  are  numerous  approaches  to 
classification,  encompassing  qualitative  and  quantitative  techniques.  The  focus  of  this 
thesis  is  on  quantitative  techniques  including  discriminant  analysis  (DA)  and  artificial 
neural  networks  (ANN). 

Regardless  of  the  approach  used,  there  will  likely  be  errors  in  determining  the 
class  in  which  an  observation  belongs.  Associated  with  misclassification  errors  are  costs 
or  losses.  Some  costs  are  minimal,  such  as  denying  college  admission  to  someone  who 
would  graduate.  This  will  only  hurt  an  institution  if  they  do  not  admit  and  graduate 
enough  students  to  make  money.  In  other  situations  however,  misclassifications  can  have 
very  serious  consequences.  If  cancerous  cells  are  misdiagnosed  as  benign,  lives  could  be 
lost.  The  goal  of  all  classifying  problems  is  to  minimize  misclassifications,  particularly 
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those  that  are  very  costly.  Therefore,  it  is  important  to  understand  the  situations  where 
classifiers  will  perform  well,  as  well  as  the  situations  where  they  struggle. 

There  are  certain  problems  for  which  some  classifiers  perform  poorly.  Alsing  [1], 
in  evaluating  competing  classifiers,  presented  several  challenges  to  a  linear  or  quadratic 
discriminant  classifier.  Data  that  is  not  separable  in  a  linear  or  quadratic  fashion  defeats 
linear  and  quadratic  classifiers.  Examples  of  such  problems  include  XOR  data,  the  Block 
C  problem  (Figure  1-1)  and  the  Iron  Cross  problem  (Figure  1-2.)  These  problems  depart 
from  multivariate  normality  into  the  realm  of  pattern  recognition  as  it  might  be  applied  to 
image  classification  and  human  behavior. 


Class  1  !  Class  2 


Figure  1-1.  Block  C  Problem 


Class  1  _ 

Class  2  I  I 


Figure  1-2.  Iron  Cross 
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The  dimensionality  of  the  data  can  also  pose  problems  for  a  classifier.  G.  V. 

Trunk  [17]  purports  that  prediction  accuracy  of  a  classifier  will  drop  to  50%  as  the 
number  of  dimensions  in  the  data  increases  for  a  finite  data  set.  In  his  application,  he 
adds  real  features  to  the  exemplars,  with  the  distance  between  the  two  classes  for  each 
successive  feature  approaching  zero.  Classification  is  accomplished  using  a  simplified 
classifier,  which  assumes  the  distribution  of  the  two  classes,  and  does  not  estimate  this 
information  from  the  data.  While  these  assumptions  are  not  viable  for  the  techniques  that 
will  be  discussed  in  this  thesis,  it  does  suggest  that  the  number  of  features  has  a 
detrimental  impact  on  classification  accuracy.  This  thesis  will  explore  the  relationship 
between  dimensionality  and  classification  accuracy  for  DA  and  AhJNs.  It  will  also 
measure  the  impact  that  feature  selection,  the  removal  of  insignificant  features,  has  on 
classifier  performance. 

DA  and  ANNs  are  generally  used  for  classification  and  pattern  recognition 
problems  [20].  These  classifiers  attempt  to  map  the  input  vectors  to  vectors  of  ones  and 
zeros  (depending  on  the  number  of  classes  in  the  problem).  In  addition  to  classification 
problems,  ANNs  can  be  applied  to  nonlinear  regression  [20].  Radial  basis  neural 
networks  (RBNN)  can  be  employed  in  a  generalized  regression  neural  network  (GRNN) 
framework.  In  this  framework  the  networks  fit  a  nonlinear  function  to  the  input  data, 
providing  a  function  as  output  instead  of  a  classification  vector  or  value  [19].  A  special 
case  of  nonlinear  regression  is  time  series  analysis,  where  the  features  are  the  previous 
responses  (in  time)  with  some  delay  [9]. 

1.2  Problem  Statement  and  Research  Objectives 

This  thesis  will  compare  the  efficacy  of  the  aforementioned  classifiers  using 
several  techniques  explored  in  Alsing  [1].  One  measure  used  will  be  classification 
accuracy  -  an  estimate  of  the  Actual  Error  Rate  (AER)  calculated  from  applying  the 
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classifier  developed  against  an  independent  validation  data  set.  Receiver  Operating 
Characteristic  (ROC)  curves  will  also  be  used  to  compare  the  impact  of  differing  decision 
criteria  on  Type  1  and  11  errors.  Lastly,  a  Multinomial  Selection  procedure  will  be  used  to 
rank  the  classifiers  over  the  different  problems. 

Hard-to-solve  problems  will  be  explored  in  relation  to  the  classifiers.  The 
problems  evaluated  will  include  general  classification  and  feature  selection  problems. 
This  thesis  will  explore  the  problems  dimensionality  poses  to  a  general  classification 
problem.  It  will  also  analyze  different  pattern  recognition  problems  of  varying 
complexity  to  challenge  the  classifiers.  Finally,  it  will  apply  the  classification  techniques 
against  breast  cancer  data  from  the  University  of  Wisconsin  [18]  and  Fisher’s  Iris 
Problem  [4]. 

The  goal  of  this  research  is  two-fold.  The  main  research  objective  is  to  develop 
an  integrated  architecture  and  feature  selection  algorithm  for  RBNlSI’s.  This  feature 
selection  algorithm  will  be  compared  with  the  feature  selection  techniques  for  the  other 
classifiers.  A  secondary  goal  included  in  this  effort  is  to  evaluate  the  overall  effect 
feature  selection  has  on  classification  accuracy  across  the  classifiers. 

Further,  different  classifiers  will  be  evaluated  against  a  set  of  challenging 
problems.  The  goal  is  to  explore  differences  in  classifier  performance  against  a  broad  set 
of  problems  and  to  develop  a  methodology  to  determine  the  appropriateness  of  different 
classification  techniques  for  these  problems.  This  will  aid  in  determining  the  best 
alternatives  for  different  problem  types. 
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2  Literature  Review 


2.1  Overview 

This  chapter  reviews  the  literature  regarding  the  classifiers  under  discussion  and 
various  evaluation  criteria  used  for  classifiers.  The  research  is  focused  on  the  area  of 
feature  selection.  For  Discriminant  Analysis  (DA),  there  is  a  discussion  of  two 
approaches  to  feature  selection:  Stepwise  DA  and  Discriminant  Loadings  (DL).  The 
literature  review  regarding  Feed  Forward  Neural  Networks  (FFNN)  will  cover  network 
architecture,  backpropagation  and  feature  selection.  For  the  last  classifier.  Radial  Basis 
Function  Neural  Networks  (RBNN),  there  is  no  developed  feature  selection  algorithm; 
several  proposed  solutions  will  be  explored  in  chapter  3.  The  literature  review  for  RBNN 
will  concentrate  on  network  architectures,  kernel  functions  and  clustering  algorithms. 

2.2  Discriminant  Analysis  (DA) 

DA  classifies  exemplars  into  groups  by  creating  a  hyperplane  -  either  linear  or 
hyperbaloid  -  to  separate  the  feature  space  into  two  distinct  areas  (for  the  two-group 
problem).  This  decision  line  is  based  on  the  within-class  mean  vectors  and  the 
covariance  structure  of  the  features.  If  the  two  classes  are  linearly  or  quadratically 
separable,  DA  will  perfectly  differentiate  between  the  two  classes  if  the  appropriate  form 
is  used. 

A  key  assumption  for  DA  is  that  the  independent  variables  must  possess  a 
multivariate  normal  distribution  [6].  While  the  technique  remains  robust  against  small 
departures  from  normality,  if  the  data  severely  departs  from  this  assumption, 
classification  accuracy  can  be  greatly  affected.  Additionally,  this  can  impact  the 
statistical  method  of  feature  selection.  Stepwise  DA,  discussed  below. 
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The  second  assumption  impacts  the  DA  method  used  -  Fisher’s  Approach  or 
Quadratic  Discrimination.  To  use  Fisher’s  Approach,  the  within  class  covariance 
structure  must  be  equal  for  the  two  groups  being  classified.  This  assumption  can  be 
tested  using  the  following  hypothesis  test  [3].  The  null  hypothesis  states  that  the  within 
class  covariance  matrices  are  from  the  same  underlying  distribution.  Under  the  null 
hypothesis 


P{-2p\nW,<Z]^p\x^F  <Z}  (2.1) 

where  q  =  number  of  groups,  p  =  number  of  variables,  N  =  total  sample  size,  n  =  N-q, 

Ng  =  number  in  group  g,  ng  =  Ng-  \  and  F  the  degrees  of  freedom  for  the  test,  and  where. 


p  =  l 


X— -- 


V.=> 


Ip^  +3/)-l  ^ 

6(/?  + 1)4-1) 


(2.2) 


=Xj«,ln|s^|-jnln|l|  (2.3) 

«=i  ^  ^ 

F  =  ^(q-\)p(p  +  \)  (2.4) 

If  the  test  statistic,  -IplnWi,  is  sufficiently  large,  we  reject  the  null  hypothesis  and 
conclude  the  within  class  covariance  structures  are  unequal. 


2. 2. 1  Fisher ’s  Approach. 

Under  the  assumption  of  a  common  covariance  structure,  Fisher’s  approach  can 
be  applied  to  solve  the  problem.  Fisher  sought  to  maximize  the  following  equation 

(2S 

This  equation  describes  the  squared  distance  between  the  discriminant  scores  of  the  two 
class  means  (h^//,)  with  respect  to  the  variance  of  the  discriminant  scores  {b^ Zb)  [3].  The 
solution  b  to  solve  this  nonlinear  program  is  [6] 
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b  =  (2.6) 

For  any  practical  problem,  the  true  population  parameters  are  unknown,  and  therefore, 
need  to  be  approximated  using  the  sample  means  and  covariance  as  unbiased  estimators 
of  the  true  parameters. 

To  classify  a  new  exemplar,  the  linear  combination  is  applied  to  the  new  data 
point.  In  this  thesis,  the  prior  probabilities  of  the  two  groups  are  assumed  to  be  equal,  as 
well  as  the  “costs”  of  misclassification.  In  this  problem,  exemplars  are  classified 
according  to  which  side  they  are  of  the  midpoint  of  the  centroids  (mean  vectors)  in 
projected  space  which  is 

M  =  =  +X2)  (2.7) 

The  decision  rule  (in  projected  space)  becomes:  If  Y„e^,  =  b^X„eu,  >  M,  classify  as  Group  1 
-  otherwise  classify  as  group  2.  This  assumes  the  projection  of  the  group  one  centroid  is 
larger  in  the  projected  space  than  that  of  the  second  group. 

2.2.2  Quadratic  Discriminant  Functions 

The  quadratic  discrimination  approach  provides  a  greater  ability  to  separate 
classes  -  particularly  if  the  classes  are  not  linearly  separable.  This  approach  is  necessary 
if  the  covariance  structure  is  different  for  the  two  classes,  and  allowing  for  these 
differences  provides  the  greater  flexibility.  This  approach  is  also  easily  extended  to  more 
than  two  classes.  Each  class  generates  its  own  quadratic  discriminant  score  [6] 

rfg,  =-^\r\Z\-^(x-nJ-Lr'{x-n,)+\Ti(P,)  (2.8) 

where  P,  is  the  prior  probability  of  the  exemplar  belong  to  class  i.  The  decision  rule  is 
very  simple;  an  exemplar  is  classified  according  to  the  largest  discriminant  score.  This 
approach  will  produce  results  identical  to  Fisher’s  equation  if  the  within-class  covariance 
matrices  are  identical.  Because  of  the  flexibility,  greater  classification  power  provided  by 
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this  approach  and  the  relaxation  of  the  assumption  of  equal  within-class  covariance 
structures  (although  multivariate  normality  is  now  assumed),  quadratic  discriminant 
functions  will  be  used  for  all  applications  discussed  in  this  thesis. 


2.2.3  Feature  Selection 

As  discussed  previously,  two  different  approaches  to  feature  selection  will  be 
explored.  Stepwise  DA  and  Discriminant  Loadings.  Both  applications  will  be  discussed 
in  a  backward  selection  paradigm  -  all  the  features  will  be  included,  and  one  feature  will 
be  removed  at  a  time  according  to  a  selection  criteria. 

Stepwise  DA  employs  partial  F-tests  similar  to  stepwise  regression.  Without 
multivariate  normality,  the  F  statistics  will  not  accurately  describe  the  significance  of  the 
individual  features.  If  the  data  is  taken  from  a  multivariate  normal  distribution,  the 
following  statistic  is  distributed  as  F(p-i_N-p  - 1)  [8] 

r  \ 


F  = 


1 

1 

(  ^1^2  ^ 

1  ) 

U(A'-2)J 

■A%-, 


1  + 


N{N-2) 


A%-, 


(2.9) 


where  A  =  total  sample  size,/?  =  number  of  variables.  A,  =  number  in  group  i,  and  zl  ,  are 
the  Mahalanobis  distance  between  the  respective  group  means,  defined  to  be  [6] 


(2.10) 

This  test  statistic  compares  the  distance  between  the  means  with  all  p  features,  A  with 
the  Mahalanobis  distance  with  one  feature  removed,  A^p.i.  A  feature  is  considered 
significant  if  F  >  Fa,  the  null  hypothesis  being  that  the  feature  is  not  significant.  Under  a 
backward  selection  routine  all  features  are  included  in  the  original  model.  During  each 
iteration,  the  F  statistic  is  calculated  for  each  feature,  and  the  least  significant  feature  is 
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removed  (the  feature  with  the  smallest  F  value)  [8].  This  process  continues  until  all  the 
insignificant  features  are  removed  or  until  only  the  most  significant  feature  remains. 

Discriminant  Loadings  provide  an  alternative  to  Stepwise  DA,  and  do  not  require 
the  assumption  of  multivariate  normality;  however,  the  technique  does  assume  equal 
within-class  covariance  structures.  Discriminant  Loadings  provide  the  correlation  of  a 
feature  with  the  discriminant  function.  Loadings  have  the  following  form  [3] 


DL  =  RD~  '  h(h^Ch)" 


(2.11) 


where  C  is  the  sample  covariance  of  A,  D~  is  the  matrix  of  the  diagonal  elements  of  C 


and  R  is  the  sample  correlation  of  X.  It  is  assumed  that  the  least  significant  feature  has 
the  smallest  loading  in  absolute  value.  Similarly,  the  most  significant  feature  has  the 
largest  loading.  As  with  Stepwise  DA,  Discriminant  Loadings  can  be  applied  in  an 
iterative  manner.  For  each  iteration,  the  loadings  are  calculated  and  the  feature 
corresponding  to  the  smallest  loading  is  removed. 

Dillon  and  Goldstein  [6]  assert  that  Discriminant  Loadings  provide  a  clearer 
indication  of  which  features  are  important.  The  loadings  reflect  common  variance  among 
the  predictors,  and  are  less  subject  to  multicollinearity  among  the  features.  The  partial  F- 
values  used  in  Stepwise  DA  however,  can  be  confounded  by  highly  correlated  features. 
For  these  reasons,  this  thesis  will  employ  Discriminant  Loadings  to  perform  feature 
selection. 


2.3  Feed-Forward  Neural  Networks 

FFNN’s  (as  well  as  the  other  Artificial  Neural  Networks  (ANN))  employ  a 
completely  different  approach  to  classification  than  DA.  ANN’s  are  loosely  based  on  a 
biological  concept.  Neurodes  are  connected  and  information  is  passed  between  them. 
The  key  to  using  this  structure  for  classification  is  the  updating  of  the  information  being 
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passed.  In  FFNN’s,  this  process  is  called  learning,  and  its  goal  is  to  produce  outputs  that 
closely  resemble  the  class  membership  [3].  Figure  2-1  illustrates  a  standard  FFNN. 

There  are  generally  three  layers  to  the  network:  Input,  Hidden  and  Output.  The  upper 
layers  receive  a  weighted  sum  of  the  outputs  of  the  previous  layer’s  nodes.  Inside  the 
node,  a  threshold  function  is  applied  to  this  sum,  restricting  the  function  values  to  the 
interval  [0,1]  or  [-1,1].  The  most  commonly  used  threshold  function  is  the  sigmoid 
function  (see  Figure  2-2).  It  restricts  the  network  output  to  the  interval  [0,1],  and  most 
importantly  is  differentiable.  This  is  critical  for  backpropagation  to  work.  It  has  the 
following  form 

/W  =  ^  (2.12) 

With  enough  nodes  in  the  hidden  layer,  FFNhJ  are  universal  function  approximators.  A 
FFNN  is  an  ANN  where  all  the  eonneetions  move  from  lower  to  higher  levels. 

Output  Layer 


Hidden  Layer 


Input  Layer 


Figure  2-1.  FFNN  with  Bias  and  Single  Output  [3] 
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Figure  2-2.  Sigmoid  Function 


2.3.1  Backpropagation 

Backpropagation  is  the  standard  manner  by  which  the  weights  are  updated  in  a 
FFNN  [11].  Typically,  the  goal  of  the  network  is  to  produce  outputs  that  are  very  close 
to  one  for  class  one  and  zero  for  class  two.  The  weights  are  adjusted  during  training  to 
minimize  the  total  squared  error 

£  =  (2.13) 

/■=] 

where  n  is  the  number  of  exemplars,  P  is  the  target  and  is  the  network  output  for  the 
exemplar.  The  weights  are  initialized  randomly,  and  then  a  gradient  descent  routine  is 
used  to  iteratively  update  the  weights.  The  weights  are  updated  until  the  error  converges, 
or  until  we  have  cycled  through  the  data  (an  epoch)  the  maximum  number  of  times.  For 
each  exemplar,  the  error  is  calculated.  The  weights  are  updated  according  to  the  gradient 
of  the  error  with  respect  to  the  weights.  First  the  upper  weights,  Uk  (see  Figure  2-1),  are 
updated,  and  then  are  used  to  update  the  lower  weights,  Wj,k-  The  weight  updates  for  the 
upper  weights  for  the  exemplar  have  the  following  form 
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(2.14) 


where  is  the  output  of  the  hidden  node  for  exemplar  i  and  r\  is  the  learning  rate 

(preferably  around  0.01).  The  lower  weights  are  updated  in  the  following  fashion 
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(2.15) 


where  xf^  is  the  feature  of  the  exemplar. 

Apart  from  a  strict  gradient  search  routine,  there  are  many  techniques  that  are 
used  to  accelerate  convergence  [12].  These  techniques  include  the  Conjugate  Gradient 
Method,  which  uses  a  second-order  approximation  of  the  gradient  along  which  to  move. 
Momentum  modifies  the  gradient  by  adding  a  first-order  term  containing  the  previous 
weight  update,  and  is  used  to  smooth  the  direction  of  descent.  Adaptive  learning  adjusts 
the  learning  rate  around  a  minima,  by  shrinking  the  step  size.  This  thesis  will  employ 
MATLAB®’s  “traingdx”  routine,  with  a  momentum  coefficient  of  0.9,  and  adaptive 
learning  rates  of  1.05  and  0.7  for  increasing  and  decreasing  the  learning  rate  respectively. 


2.3.2  Feature  Selection 

There  are  two  main  forms  of  feature  selection  for  FFISIISI,  derivative-based  and 
weight-based  saliency  [3].  Derivative  based  saliency  techniques  measure  the  change  in 
unit  output  per  unit  change  in  each  of  the  features.  For  FFNN’s,  this  is  generally 
approximated  and  not  calculated  in  closed  form.  Weight-based  saliency  instead  uses  the 
lower  layer  of  weights  to  determine  feature  significance.  The  saliency  measure  for 
feature  i  is 

(2.16) 

where  J  is  the  number  of  hidden  nodes.  The  smaller  the  saliency  measure,  the  less 
significant  the  feature. 
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While  both  saliency  measures  provide  a  numerical  scale  for  feature  significance, 
neither  measure  provides  a  criteria  for  what  is  truly  significant.  Bauer  et.  al.  [4]  have 
proposed  an  objective  criteria  for  determining  significance,  the  Signal-to-Noise  Ratio 
(SNR)  Saliency  Measure.  In  this  technique,  a  noise  feature  is  added  to  the  data  prior  to 
training,  taken  from  a  Uniform(0,l)  population  for  both  classes.  After  training  is 
accomplished,  the  weights  for  this  feature  should  remain  close  to  zero.  The  other 
feature’s  weight-based  saliency  measures  are  then  compared  to  the  noise  variables 
saliency,  and  the  SNR  for  feature  i  becomes 

SNR,  (2.17) 

where  Tn  is  the  saliency  for  the  noise  variable.  Those  features  with  a  SNR  less  than  zero 
are  determined  to  be  insignificant,  and  can  be  removed  from  the  data  set.  Some  care 
must  be  taken  in  removing  features,  since  the  initial  weights  can  greatly  impact  this 
measure.  Training  several  networks  with  different  random  weights  can  provide  more 
confidence  in  the  significance  of  different  features. 

2.4  Radial  Basis  Function  Neural  Networks  (RBNN) 

RBNN  differ  from  FFNN  in  several  very  fundamental  ways.  Both  general 
network  architecture  and  training  differ  between  the  two.  RBNNs  belong  to  the  general 
class  of  probabilistic  neural  networks  (PNN).  Under  the  PNN  paradigm,  classification  is 
performed  by  estimating  a  probability  density  function  (PDF)  for  each  class.  A  new 
exemplar  is  classified  according  to  the  class  whose  density  function  is  more  likely. 

Unlike  FFNN’s,  PNN’s  do  not  require  training.  A  training  set  is  read  in,  and  is  used  to 
generate  the  PDF’s  for  each  class  [19]. 

Kernel  density  estimation  is  the  process  by  which  the  PDF’s  are  estimated.  A 
kernel  density  function  is  any  function  K  satisfying  the  following  equation  [15] 
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K{x)dx  =  1 


(2.18) 


Kernels  are  typically  symmetric,  though  not  necessarily.  The  Epanechnikov  kernel  is  the 
most  efficient  kernel  density  function;  the  kernel  minimizes  the  integrated  square  error  of 
the  estimator.  It  has  the  multivariate  form 


KM 


2c, 


-(J  + 2)(1 -x^x)  x^x  <  1 


(2.19) 


[  0  otherwise] 

where  Cd  is  the  volume  of  the  J-dimensional  unit  sphere  [15].  Figure  2-3  illustrates  the 
univariate  form  of  the  Epanechnikov. 


Figure  2-3  Univariate  Epanechnikov  Kernel 

Although  the  Epanechnikov  kernel  is  the  most  efficient  method,  the  choice  of 
kernel  functions  is  relatively  insignificant.  Efficiency  of  every  other  kernel  estimator  is 
compared  as  a  ratio  to  the  Epanechnikov  kernel.  For  example,  the  Gaussian  kernel  is 
approximately  95%  efficient,  and  is  the  most  widely  used  kernel  estimator,  particularly 
for  PNN  [19].  The  Gaussian  kernel  has  the  multivariate  form 

1  f-lx^xl 

K{x)  =  -^=e^  '  ^  (2.20) 
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The  PDF  is  the  sum  of  the  kernels,  with  each  weighted  by  1/N,  keeping  the  resulting 
function  a  PDF  (maintaining  the  property  of  equation  2.18)  [15]. 

Under  the  PNN  paradigm,  each  basis  function  output  is  weighted  equally. 

RBNNs  allow  the  weighting  for  each  output  to  be  different.  For  RBNN,  the  hidden  layer 
is  made  up  of  kernel  functions  centered  at  each  exemplar  of  the  training  set  (in  its 
simplest  form).  Each  exemplar  in  whole  is  passed  to  each  neurode,  where  the  kernel 
function  maps  the  ^-dimensional  input  vector  into  the  real  numbers.  This  leads  to  the 
general  network  architecture  seen  in  Figure  2-4. 


Hidden  Output 

Inputs  Layer  Layer 


Figure  2-4.  RBNN  with  Single  Output 


In  this  thesis,  the  standard  function  in  the  hidden  layer  will  be  the  Gaussian  with 
the  form: 

/  I  \tI  W 


h.  (x)  =  exp 


2a, 


(2.21) 


Training  is  accomplished  in  a  similar  manner  to  backpropagation  is  used  for  FFNN  [19]. 
As  seen  in  Section  2.3.2,  gradient  search  is  used  to  find  the  minimum  error.  For  RBNN, 
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the  training  algorithm  is  much  simpler,  with  only  one  layer  of  weights  to  train.  A  single 
output  network  will  use  the  following  equation  to  update  the  weights 

(«  +  !)  =  w,.  (n)  +  ri{t  -  y)z^  (2.22) 

where  z,  =  hi(x),  t  is  the  target  value,  and  w,  and  y  are  as  described  in  Figure  2.3.  A 
single  exemplar  (x)  is  passed  through  all  the  hidden  neurodes  to  obtain  the  output  of  the 
network,  y.  Each  hidden  weight  is  then  updated  using  Equation  2.22.  When  all  the 
training  exemplars  are  processed,  one  epoch  is  complete.  This  process  will  continue  until 
the  error  is  small  enough. 

The  training  for  RBNN  is  guaranteed  to  converge  to  a  global  minimum  if  the 
classes  are  separable  by  hyperplanes,  unlike  FFNISI  where  the  training  might  get  caught  in 
a  local  minimum  [11,  16].  Training  for  a  RBNN  is  also  considerably  faster  than  for  a 
FFNN.  For  networks  of  similar  size,  the  difference  in  training  time  can  be  as  large  as 
three  orders  of  magnitude  [19]. 

Selecting  the  receptive  fields  (o})  for  each  center  is  also  necessary.  If  chosen  too 
large,  the  center  will  have  too  great  an  impact  on  the  output  of  exemplars  far  from  the 
center.  If  chosen  too  small,  the  network  will  only  activate  for  those  exemplars  located  at 
the  centers,  leaving  gaps  in  the  classifier.  One  method  which  has  produced  favorable 
results  consists  of  setting  O}  equal  to  the  distance  between  the  f  ^  center  and  its  nearest 
neighbor  [19].  The  nearest  neighbor  approach  will  be  used  in  this  thesis  to  estimate  the 
receptive  fields  used  for  the  radial  basis  functions. 

2. 4. 1  Cluster  Algorithms 

Even  though  training  is  much  quicker  for  RBNN  than  FFNN,  subsequent 
application  of  the  network  to  new  exemplars  can  take  much  longer.  The  size  of  the 
network  in  terms  of  the  number  of  hidden  nodes  can  be  much  larger  for  a  RBNN  than  for 
an  equivalent  FFNN  [11,16].  Clustering  techniques  can  be  used  to  represent  multiple 
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hidden  nodes  with  a  single  node,  thus  reducing  the  computational  effort  required  for 
training  which  is  proportional  to  number  of  training  vectors  [12]. 

One  must  be  careful  not  to  use  clustering  techniques  indiscriminately.  As  the 
number  of  features  increases,  clustering  techniques  can  erroneously  identify  cluster 
centers,  clustering  around  features  which  are  not  useful  for  classification  [19].  This 
indicates  that  feature  selection  can  improve  clustering  accuracy,  which  will  in  turn 
improve  classification  accuracy.  Three  clustering  algorithms  will  be  discussed  next:  a 
simplified  algorithm  due  to  Wasserman,  /f-Means  and  the  Radial  Basis  Function  Iterative 
Construction  Algorithm  (RICA).  Supplemental  flowcharts  will  be  included  for  additional 
clarification. 

Wasserman  [19]  presents  a  simple  clustering  algorithm,  in  which  nodes  are 
pruned  (removed  from  consideration  as  centers)  and  have  no  impact  on  the  centers  used 
when  the  network  is  trained.  Each  class  is  processed,  with  the  centers  produced  in  a 
single  pass  through  the  data.  The  first  exemplar  is  chosen  as  a  basis  function  center. 

Each  subsequent  exemplar  is  processed  using  Euclidean  distance  to  determine  the  closest 
center.  If  this  distance  is  smaller  than  a  threshold  distance,  the  exemplar  is  discarded.  If, 
however,  the  distance  is  larger  than  the  threshold,  the  exemplar  becomes  a  new  center. 
One  problem  with  this  algorithm  is  that  different  sequences  will  produce  very  different 
results.  It  also  discards  information  about  the  density  of  the  training  data,  since  nodes  are 
pruned,  instead  of  impacting  the  location  of  the  centers. 

Ai-Means  clustering  is  a  self-organizing  procedure.  Unlike  the  simple  clustering 
discussed  above,  it  is  iterative,  stopping  when  the  centers  selected  remain  the  same.  It 
derives  its  name  from  the  output  of  the  algorithm.  A  number  of  clusters  (K)  is  specified, 
and  the  algorithm  returns  the  means  of  each  cluster  of  data  [2].  Each  class  will  be 
clustered  separately,  with  K  not  necessarily  the  same  for  each  class.  There  are  several 
ways  to  initiate  the  algorithm,  but  the  most  common  is  to  assign  K  random  exemplars  as 
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initial  centers  [6].  Each  successive  exemplar  is  assigned  to  the  nearest  center.  Once  all 
the  data  is  assigned  to  a  cluster,  the  means  of  each  cluster  become  the  new  centers.  The 
data  is  processed  in  the  following  manner  until  the  centers  remain  the  same  between 
iterations  [12]. 

Without  a  priori  knowledge  of  the  number  of  clusters,  the  selection  of  K  involves 
experimentation.  One  measure  for  accomplishing  this  task  is  the  squared  sum  of  the 
deviations  of  each  exemplar  from  its  cluster  center.  Candidate  K  values  for  are  used,  and 
that  value  of  K  which  produces  the  smallest  error  is  selected  [12].  Certain  values  for  K 
should  be  excluded.  If  K  is  allowed  to  be  equal  to  the  number  of  exemplars,  the  error 
will  be  zero,  and  the  algorithm  will  produce  clusters  equivalent  to  the  training  data. 
Hence,  if  K  is  allowed  to  approach  the  number  of  exemplars,  too  many  clusters  will  just 
contain  one  point.  For  this  research,  K  is  limited  to  one  half  the  number  of  exemplars  for 
a  given  problem.  Figure  2-5  below  illustrates  the  algorithm  in  flow-chart  form. 


Figure  2-5.  Ai-Means  Algorithm 
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While  the  preceding  algorithms  simply  define  the  cluster  means,  RICA  describes 
the  distribution  of  each  center  individually  described  by  the  mean  and  covariance  of  the 
cluster.  The  end  result  of  the  procedure  is  [21] 

/z,(x)  =  e  2  _  (2.23) 


The  key  to  the  algorithm  is  determining  the  number  of  clusters  and  their  partitioning. 
Wilson  [21]  proposes  using  Shapiro- Wilk  test  statistics  to  determine  if  the  current 
partition  is  sufficiently  distributed  as  a  multivariate  normal.  A  Shapiro- Wilk  test  statistic 
for  the  current  partition  of  the  data  is  compared  to  the  test  statistic  of  two  partitions 
generated  from  the  current  one.  Wilson  employs  the  univariate  form  of  the  test  statistic 
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(2.24) 


where  a,  are  weighting  coefficients  developed  by  Shapiro  and  Wilk,  available  in  tables 
[5]  for  n  <50,  Xa)  are  the  ordered  data  and  is  the  sample  variance.  For  n  >  50,  Shapiro 
and  Wilk  provide  the  following  approximations  for  the  coefficients  [14] 


where 


for  i  ^\,n 


(2.25) 


m,  =  O 


/- 0.375 


,i  =  \,...n  [10] 


«  +  0.25, 

with  being  the  inverse  cumulative  distribution  function  of  the  standard  normal 
distribution  and 


(2.26) 


C  =  V- 2.722 +  4.083n  (2.27) 

For  a-i  and  a„,  they  propose  a  different  approximation 


2-15 


(2.28) 


As  the  data  tends  toward  a  normal  distribution,  the  test  statistic  tends  toward  1.0;  the  test 
statistic  will  approach  zero  for  data  that  is  distinctly  non-normal  [5].  If  the  current 
partition  has  a  larger  test  statistic  than  either  of  the  sub-partitions  created,  it  is  kept. 
Otherwise,  the  two  new  partitions  will  be  kept  and  analyzed  in  the  same  manner  [21]. 

The  partitioning  of  the  data  is  accomplished  by  employing  .^-Means  with 
Mahalanobis  distance  used  instead  of  Euclidean  distance.  Using  Mahalanobis  distance 
preserves  the  correlations  present  in  the  data  [21].  If  the  data  is  standardized  and  the 
features  are  independent,  the  two  distances  will  produce  the  same  results,  but  this  is  not 
always  the  case.  The  original  partitioning  of  the  data  is  created  using  Euclidean  distance, 
since  there  is  no  covariance  structure  for  the  two  centers.  Once  the  data  is  clustered,  the 
sample  means  and  covariances  will  be  used  in  the  next  iteration.  The  .^-Means  algorithm 
is  then  employed  iteratively  as  described  above.  Because  the  algorithm  requires  a 
covariance  matrix  for  each  cluster,  if  any  partition  has  fewer  than  p+1  data  points  {p 
being  the  number  of  features)  the  algorithm  will  stop.  If  the  covariance  matrix  does  exist, 
its  inverse  will  not  exist  if  some  of  the  features  are  linearly  independent.  This  is 
evidenced  by  eigenvalues  of  the  covariance  matrix  being  zero.  This  can  be  rectified  by 
replacing  these  eigenvalues  with  a  threshold  value  of  0.5.  The  modified  covariance 
matrix  becomes  [21] 

C  =  (2.29) 

where  D  is  the  matrix  with  the  modified  eigenvalues  along  the  diagonal  and  V  is  the 
matrix  of  eigenvectors  of  the  sample  covariance  matrix,  C. 
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While  Wilson  [21]  uses  the  univariate  form  of  the  Shapiro-Wilk  test  statistic,  it  is 
not  clear  how  the  multivariate  data  is  applied.  Malkovich  and  Afifi  [13]  have  proposed  a 
multivariate  generalization  of  the  test  statistic 


(2.30) 


where 

p-31) 

and  Ym  is  the  observation  that  has  the  maximum  value  over  all  the  observations  of 


(y^-yJ  A-'{y^-y)  (2.32) 

The  Gj  are  defined  identically  to  those  for  the  univariate  test,  and  Uy)  are  the  order 
statistics.  The  order  statistics  are  defined  by  ordering  the  following  statistics 

U,=(y,-yJ  A-'(y,-y)  (2.33) 

W*  has  the  same  interpretation  as  W,  namely  the  closer  to  1 ,  the  more  normal  the 
underlying  population.  Using  W*  instead  of  W  in  Wilson’s  algorithm  provides  a  more 
meaningful  multivariate  interpretation  while  being  computationally  simpler.  Figure  2-6 
describes  the  algorithm  with  n  denoting  the  number  of  exemplars  and  m  the  number  of 
features. 
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Figure  2-6.  RICA  Clustering  i 
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Discard  Current  Cluster 
—  Retain  Subclusters 


2.4.2  General  Regression  Neural  Network 

General  Regression  Neural  Networks  (GRNN)  are  a  class  of  RBNN  used 
predominantly  for  non-linear  regression  [19].  The  hidden  layer  is  identical  in  structure 
and  setup  to  the  standard  RBNN  with  a  Gaussian  kernel  centered  around  each  exemplar 
in  the  training  set.  There  is  an  additional  layer,  as  well  as  an  additional  output  from  the 
hidden  layer.  The  eventual  output  of  this  network  is  the  weighted  output  (z)  scaled  by  the 
unweighted  output  of  the  hidden  layer  (5).  Figure  2-7  illustrates  this  architecture. 


Hidden  Normalized 

Layer  Outputs 


Figure  2-7.  GRNN  with  Single  Output 

The  primary  difference  between  GRNN  and  RBNN  is  the  training  of  the  hidden 
weights.  There  is  no  training  for  GRNN’s  [19].  Each  pair  in  the  training  set  is 
folded  into  the  network.  The  input  vector,  x„  is  the  center  of  radial  basis  function,  h  -,,  and 
the  output,  yi,  is  the  hidden  weight  for  that  node.  If  the  spread,  o;,  is  very  small,  the 
network  will  have  no  error  against  the  training  set,  however,  the  network  will  not  be 
applicable  to  new  exemplars.  The  choices  for  a,  can  be  made  in  the  same  manner  as  the 
RBNN,  using  the  nearest  neighbor  method. 
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2.5  Evaluation  Techniques 

There  are  several  common  techniques  used  to  evaluate  the  utility  of  a  classifier. 
The  most  common  is  estimating  the  Actual  Error  Rate  (AER).  This  estimate  of  true  error 
is  obtained  by  applying  the  classifier  to  an  independent  validation  set.  This  is  due  to  the 
fact  that  using  the  training  set  will  tend  to  underestimate  the  error  [3].  There  are  two 
components  to  error,  namely  False  Positive  (FP)  and  False  Negative  (FN).  Positive 
corresponds  to  the  target,  Class  1  and  negative  relates  to  the  clutter.  Class  2.  A 
Confusion  Matrix  displays  this  information  graphically  as  depicted  in  Figure  2-8. 


^  c, 
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Figure  2-8.  Confusion  Matrix  [3] 


AER  can  be  computed  directly  from  a  CM 


AER  = 


FP  +  FN 

TP  +  FP  +  FN  +  TN 


(2.34) 


Using  the  estimate  of  AER  to  compare  two  classifiers  can  produce  misleading  results, 
particularly  if  the  prior  probabilities  are  very  different  [1].  Figure  2-9  illustrates  two 
different  classifiers  applied  to  a  notional  data  set.  Classifier  1  has  the  smaller  AER  (95% 
vs.  94%),  and  would  be  considered  the  best  classifier  based  on  this  measure.  Flowever, 
everything  is  classified  as  Class  2,  and  nothing  is  detected.  No  classifier  is  required  to 
produce  this  output,  an  individual  can  simply  assign  Class  2  membership  to  every 
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exemplar.  Classifier  2  has  only  a  much  better  probability  of  detection  (80%  vs  0%), 
defined  to  be 


TP 


TP  +  FN 


(2.35) 


where  TP  and  FN  dSQ  defined  as  in  Figure  2-7.  Classifier  2  also  has  an  only  slightly 
higher  probability  of  false  alarm  (5%  vs.  0%),  defined  as 


FA 


FP 

TN  +  FP 


With  this  information,  Classifier  2  appears  to  be  the  better  classifier. 


(2.36) 
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Figure  2-9.  CM  Comparison  for  Notional  Data 

2.5.1  Receiver  Operating  Characteristic  Curves 

The  CM  (as  well  as  AER)  only  address  the  performance  of  the  classifiers  at  the 
optimal  decision  threshold.  Receiver  Operating  Characteristic  (ROC)  curves  plot  Pfa 
against  Pd  for  different  decision  thresholds  [1].  Figure  2-10  illustrates  the  general 
construction  of  the  curve.  The  decision  threshold  is  set  at  a  given  number  of  intervals 
across  the  range  of  the  classifiers  output.  As  the  threshold  changes  from  left  to  right  (in 
this  figure),  both  the  Pfa  and  Pd  increase  as  fewer  exemplars  are  classified  as  Class  1. 
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Figure  2-10.  ROC  Curve  and  Decision  Thresholds 

There  are  several  metrics  that  can  be  used  to  evaluate  ROC  curves  [1].  The  first 
is  by  visual  inspection.  If  two  (or  more)  ROC  curves  are  overlaid  and  one  curve  is 
always  higher  (a  larger  Pd  for  all  Pfa),  this  classifier  performs  better.  This  will  work  in 
distinguishing  classifiers,  provided  there  is  no  overlap.  In  the  latter,  more  common 
circumstances,  objective  metrics  are  necessary. 

Alsing  [1]  presents  a  metric  that  can  be  used  to  objectively  compare  overlapping 
ROC  curves,  namely  mean  distance  metric.  ROC  curves  are  compared  to  the  chance  line, 
which  passes  from  the  origin  to  (1,1).  This  line  represents  the  ROC  curve  for  random 
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classification.  On  this  curve,  the  Pd  =  Pfa  for  all  decision  thresholds,  and  corresponds  to 
the  value  ^used  to  generate  the  point  on  the  ROC  Curve.  The  metric  is  the  average 
distance  of  the  ROC  eurve  against  this  line  for  all  points  used  to  generate  the  ROC  curve. 
In  practice,  this  metric  is 

))-(«, -s,!, 

MD  =  -!^ -  (2.37) 

n 

where  Pd(  6^  and  PFA(dj)  are  the  ordered  pair  of  the  ROC  curve  based  on  the  decision 
threshold  di.  The  elassifier  with  the  largest  mean  distance  metric  is  considered  to 
perform  best  for  the  specific  problem. 


2.5.2  Multinomial  Selection  Procedures 

Alsing  [1]  developed  another  comparison  procedure,  a  Multinomial  Selection 
Technique.  This  technique  compares  posterior  probabilities  for  each  point  in  the 
validation  set.  The  posterior  probabilities  for  quadratic  discriminant  analysis  applied  to 
a  two  class  problem  are  [3] 


PP 


•q, 


^  ^Q2 


(2.38) 


For  FFNN  that  are  trained  to  zero  and  one,  the  class  one  posterior  probability  for  a  given 
exemplar  is  simply  the  network  output.  The  class  two  posterior  probabilities  for  the 
same  exemplar  are  one  minus  the  output  [3].  The  posterior  probabilities  for  a  RBNN  are 
more  problematic.  Unlike  FFNN  using  a  sigmoid  in  the  output  layer,  the  outputs  for 
RBNN  are  not  restricted  to  the  interval  (0,1).  The  outputs  therefore  are  normalized  to  the 
interval  [0,1],  and  these  normalized  outputs  become  the  posterior  probabilities. 

Once  the  posterior  probabilities  have  been  calculated,  the  multinomial  statistic 
can  be  calculated.  For  each  exemplar  in  the  validation  set,  a  “win”  is  given  to  the 
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classifier  with  the  highest  posterior  probability  for  the  class  to  which  the  exemplar 
belongs.  When  the  entire  validation  set  has  been  processed,  the  multinomial  statistic  for 
each  classifier  becomes  the  number  of  “wins”  divided  by  the  total  number  of  validation 
points.  These  statistics  are  estimates  of  the  true  multinomial  probabilities,  and 
confidence  intervals  can  be  created  around  each  value.  If  the  confidence  intervals  for 
two  different  classifiers  do  not  overlap,  the  classifier  with  the  larger  multinomial  statistic 
can  be  determined  to  be  a  better  classifier  for  the  problem.  According  to  Alsing  [1],  this 
can  be  used  if  the  other  metrics  described  above  fail  to  determine  the  best  classifier. 

In  this  chapter,  three  different  classifiers  were  explored:  DA,  FFNN,  and  RBNN. 
Feature  selection  techniques  were  described  for  DA  and  FFNN.  Additionally,  means  to 
evaluate  the  performance  of  these  classifiers  were  discussed:  AER,  ROC  metrics  and  the 
multinomial  selection  procedure.  In  the  next  chapter,  a  feature  selection  technique  will 
be  developed  for  RBNN  in  addition  to  a  new  clustering  routine. 
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3  Radial  Basis  Neural  Network  Techniques 


3.1  Overview 

This  chapter  introduces  two  new  techniques,  derivative  based  saliency  (DBS)  and 
signal-to-noise  ratio  (SNR*^™'^)  clustering.  The  first  section  of  this  chapter  details  DBS 
as  a  feature  selection  technique  for  Radial  Basis  Neural  Networks  (RBNN’s).  DBS  will 
be  compared  in  Experiment  3-1  with  the  feature  selection  techniques  used  with 
Discriminant  Analysis  (DA)  and  Feed  Forward  Neural  Networks  (FFNN),  discriminant 
loadings  and  Signal-to-Noise  Ratio  (SNR)  respectively.  The  second  section  describes  the 
SNR*^®^^  clustering  algorithm.  SNR*^®^^  will  be  compared  with  K-Means  and  the  Radial 
Basis  Function  Iterative  Construction  Algorithm  (RICA)  in  Experiment  3-2.  The  final 
section  develops  the  iterative  architecture  and  feature  selection  algorithm.  This  algorithm 
will  be  compared  to  discriminant  loadings  and  SNR  in  Experiment  3-3,  a  repeat  of 
Experiment  3-1  with  the  integrated  algorithm  replacing  .^-Means. 


3.2  Derivative  Based  Saliency 

A  derivative  based  saliency  measure  appears  to  be  the  only  feature  selection 
available  for  RBNN’s.  Weight-based  saliency  measures  are  inappropriate  because  the 
weights  are  not  applied  directly  to  the  features  as  in  FFNN.  As  with  FFNN,  it  is 
necessary  that  the  data  be  standardized  so  that  a  unit  change  in  each  feature  is  equivalent. 
Otherwise,  it  is  likely  the  feature  with  the  highest  variance  will  have  the  highest  measure. 
The  network  output  for  a  given  exemplar  /  is 


='^Wj  exp| 

7=1 
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where  p  is  the  number  of  centers,  m  is  the  number  of  features  and  is  the 
component  of  the  center.  The  partial  derivative  of  the  network  output  of  exemplar  i 
with  respect  to  feature  k  is 


DS,, 


(3.2) 


where 


When  taking  the  mean  saliency  across  all  the  exemplars,  the  average  of  DSik  can 
be  misleading.  Different  exemplars,  particularly  in  different  classes,  can  have  opposite 
signs,  moving  the  measure  two  zero.  The  measure  of  interest  is  the  magnitude  of  the 
measure  across  the  exemplars.  Therefore,  the  mean  absolute  saliency  measure  for  the 
feature  is 

n  i=\ 

where  n  is  the  number  of  exemplars  in  the  training  set.  Figure  3-1  illustrates  the 
algorithm  in  flow-chart  format.  The  complete  derivation  is  provided  in  the  Appendix. 

Examination  of  Equations  (3.2)  and  (3.3)  seem  to  indicate  that  prior  clustering  of 
centers  will  improve  the  performance  of  the  measure.  If  no  clustering  is  performed,  the  n 
exemplars  act  as  centers.  Equation  (3.2)  will  evaluate  to  zero  (or  approach  it)  for  most  of 
the  exemplar  center  pairs.  For  i  =j,  -  //i®  =  0,  and  for  those  exemplars  far  from 

centers,  hy  will  approach  zero.  If  exemplars  are  represented  by  a  center  close  to  them, 
such  as  the  mean,  neither  part  of  the  equation  will  approach  zero. 
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3.2.1  Experiment  3-1:  Simple  Feature  Selection  Test 

This  supposed  difference  must  be  verified,  and  this  technique  for  feature  selection 
needs  to  be  compared  against  discriminant  loadings  and  SNR.  A  simple  problem  will  be 
used  to  provide  preliminary  answers,  and  also  explore  the  effect  noise  has  on 
classification  problems.  The  training  and  validation  sets  for  this  problem  are  randomly 
generated  according  to  the  following  distributions.  Feature  1  is  normally  distributed  with 
a  standard  deviation  of  one,  and  a  mean  of  one  for  class  one  and  a  mean  of  negative  one 
for  class  two.  This  is  the  only  true  feature  in  the  problem,  but  there  is  considerable 
overlap  between  the  two  populations.  The  remaining  nine  features  are  noise  features, 
with  all  data  distributed  uniformly  between  negative  one  and  one.  Each  training  set 
consists  of  eleven  exemplars,  and  each  validation  set  of  fifty  exemplars  from  each  class. 
Feature  selection  is  performed  against  the  training  set,  and  the  error  rate  is  computed  on 
the  validation  set.  Four  elassifiers  (and  feature  seleetion  techniques)  were  evaluated 
against  this  problem:  DA  with  diseriminant  loadings,  FFNN  with  SNR,  RBNN  with  no 
clustering  and  DBS  and  RBNN  with  /f-means  elustering  and  DBS.  Fifty  random  samples 
of  both  training  and  validation  sets  were  made,  and  the  average  performance  is  reported. 

Figure  3-2  illustrates  the  relationship  between  classification  accuracy  and  the 
number  of  noise  features.  The  first  conelusion  that  can  be  made  is  noise  adversely 
impacts  classification  accuracy  for  all  the  competing  classifiers,  and  this  difference  is 
statistically  significant  for  an  overall  a  =  0. 1 .  This  is  most  true  of  DA,  which  performs 
considerably  worse  than  the  artificial  neural  networks  with  all  the  noise  variables 
included,  but  which  performs  best  with  only  one  feature  remaining.  Table  3-1  and  Figure 
3-3  explain  a  large  part  of  why  this  is  true.  DA  and  Discriminant  Loadings  did  not  make 
a  single  mistake  in  retaining  Feature  1  until  the  end.  Table  3-1  includes  confidence 
interval  half- widths  with  an  overall  or  =  0. 1  using  the  Bonferroni  approach.  Clustering 
improves  the  performance  of  DBS  applied  to  the  RBNN’s,  validating  the  premise  of  the 
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feature  selection  technique.  However,  even  with  .^-Means  clustering,  DBS  falls  well 
short  of  the  performance  of  FFNN  with  SNR.  This  leads  to  poor  classification  accuracy 
when  more  features  are  removed.  This  can  be  seen  in  Figure  3-2.  The  classification 
accuracies  for  both  FFNN  and  the  RBNN  with  /f-Means  clustering  are  approximately 
equal  with  four  features  remaining.  After  this  point,  the  FFNN  continues  to  improve, 
while  the  RBNN  begins  to  plateau,  and  then  dramatically  worsens  for  one  feature 
remaining.  This  gradually  worsening  performance  is  caused  by  the  RBNN  removing  the 
good  feature  too  early  and  too  often. 


Figure  3-1.  DBS  Iterative  Feature  Selection  Algorithm 
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Table  3-1.  Results  of  Feature  Selection  Test 


Expected  Ranking  of  Features 


Figure  3-3.  Average  Feature  Rankings  Experiment  3-1 

3.3  SNR  Clustering  Technique 

The  next  topic  of  discussion  involves  using  the  RBISIISI  itself  to  perform  clustering 
for  RBNN.  This  SNR  approach  follows  the  same  basic  approach  used  in  SNR  for  feature 
selection  in  FFNN.  The  first  requirement  is  a  noise  variable.  For  feature  selection  this 
involves  a  noise  feature.  In  clustering,  this  will  require  a  noise  center  added  to  the 
RBNN.  Before  defining  what  a  noise  center  is,  the  signal-to-noise  ratio  measure  will  be 
defined.  As  with  the  SNR  used  for  feature  selection,  the  weights  of  features  will  be 
compared  to  the  weights  of  the  noise  variable.  In  the  clustering  instance,  the  noise  is 
defined  as 

Noise  =  (3.5) 

where  p  is  the  number  of  centers  in  the  original  problem.  The  SNR  measure  for  each 
center  under  consideration  is 
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The  superscript  RBNN  is  used  to  distinguish  this  from  the  SNR  used  for  feature  selection 
in  FFNN.  Any  center  with  a  signal -to-noise  ratio  less  than  zero  is  considered  to  be  noise 
and  unnecessary. 

The  SNR  measure  is  very  straightforward,  but  what  is  not  obvious  is  the  meaning 
of  noise  as  it  applies  to  a  center.  When  data  are  standardized  to  mean  zero  and  unit 
variance,  most  of  the  data  will  be  massed  in  the  region  between  one  and  negative  one  in 
each  feature.  In  this  thesis,  the  noise  center  will  be  defined  as  a  random  vector  from  this 
region.  The  center  will  be  distributed  uniformly  between  negative  one  and  one  for  each 
feature.  If  a  random  center  made  with  no  knowledge  of  the  problem  has  a  greater  impact 
on  the  output  (i.e.,  has  a  larger  weight)  than  other  centers,  they  can  be  considered  as 
noise. 

The  SNR  clustering  algorithm  proceed  as  follows.  The  RBNN  is  first  trained 
using  each  exemplar  as  a  center  with  a  noise  center  added.  When  the  training  is 
complete,  the  SNR  measures  are  calculated  for  each  center.  Those  centers  with  negative 
ratios  are  clustered  with  the  nearest  within-class  center  with  a  positive  SNR.  The  centers 
for  the  final  network  become  the  cluster  means  and  the  network  is  trained  using  these 
centers.  Figure  3-4  further  illustrates  the  algorithm. 


3.3.1  Experiment  3-2:  Block-C  Clustering  Test 

This  clustering  technique  will  be  compared  with  .^f-Means  and  RICA  in  the 
following  example.  The  data  sets  will  be  generated  from  the  Block-C  distribution  shown 
in  Figure  1-1.  Each  training  set  will  contain  60  randomly  generated  data,  while  each 
validation  set  will  be  made  of  100.  All  three  clustering  algorithms  will  be  applied  to  the 
training  data.  Receiver  Operating  Characteristic  (ROC)  curves  and  estimates  of  the  AER 
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will  be  generated  from  the  validation  set.  Thirty  replications  of  this  procedure  will  be 
performed  (with  a  different  random  center  generated  for  each  iteration),  and  the  averages 
across  the  replications  reported. 


Figure  3-4.  Clustering  Algorithm 

Figure  3-5  displays  the  average  ROC  curves  for  the  three  clustering  algorithms. 
if-Means  clearly  dominates  the  other  two  clustering  techniques  for  this  problem.  The 
same  experiment  was  run  with  120  data  points  in  each  training  set  to  examine  the 


3-8 


performances  with  more  data  for  training.  Figure  3-6  demonstrates  that 
performs  almost  identically  to  .^-Means.  The  AER  for  K-Means  is  slightly  better  than  for 
SNR*^™^  (0.1 1 17  compared  to  0.1 173),  but  is  not  statistically  significant.  RICA 
improves  but  is  still  dominated  by  the  other  two  techniques. 

While  SNR*^®^^  performs  as  well  as  W-Means  with  120  data  points  in  the  training 
set,  this  problem  illustrates  the  shortcomings  of  this  clustering  technique  as  it  was  applied 
to  this  problem.  To  perform  the  clustering,  training  was  accomplished  first  with  all  the 
exemplars  as  centers  and  then  an  additional  network  was  trained  with  the  reduced 
centers.  This  can  quickly  increase  the  number  of  calculations  required,  particularly  as  the 
sample  sizes  increase.  If  the  network  is  trained  with  no  clustering,  why  cluster  and  train 
the  network  again?  The  next  section  will  discuss  how  can  be  applied  in  an 

iterative  manner. 

3.4  An  Integrated  Architecture  and  Feature  Selection  Algorithm 

As  discussed  in  Section  3.3,  applying  SNR'^^’^^to  a  problem  where  clustering  will 
be  done  only  once  entails  redundant  labor.  While  it  will  produce  a  more  parsimonious 
model,  if-Means  will  accomplish  this  with  less  computational  effort.  If  however, 
clustering  must  be  done  repeatedly  to  support  feature  selection,  it  might  prove  useful. 

One  of  the  reasons  W-Means  performs  erratically  with  DBS  is  that  different  centers  are 
generated  for  each  iteration.  This  section  will  propose  an  iterative  feature  selection 
algorithm,  and  test  it  against  the  same  problem  analyzed  in  Experiment  3-1.  Steppe  et. 
al.  [16]  provide  the  basis  for  an  alternating  architecture  and  feature  selection  approach  for 
FFNN.  The  removal  of  a  hidden  node  was  performed  followed  by  a  removal  of  a  feature. 
This  process  was  repeated  until  the  appropriate  number  of  hidden  nodes  and  features 
were  selected. 
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Average  RpC  Curves 


Figure  3-5.  Block  C  Clustering  Test  -  60  Data  Points 


Rrob  False  Alarm 

Figure  3-6.  Block  C  Clustering  Test  -  120  Data  Points 

This  algorithm  follows  the  basic  approach  of  DBS.  The  first  iteration  begins  with 
clustering  performed  with  the  whole  training  set  starting  as  centers.  Feature 
selection  is  performed,  and  the  least  significant  feature  is  removed.  The  second,  and  each 
successive,  iteration  begins  with  the  centers  provided  by  the  previous  iteration,  clustering 
the  original  centers  with  the  nearest  within-class  retained  center.  is  applied  to 

the  current  set  of  centers  (minus  the  removed  feature).  For  each  iteration,  the 
computational  effort  is  less,  as  each  step  entails  training  with  fewer  centers.  Figure  3-7 
describes  the  algorithm  in  more  detail. 

This  algorithm  can  be  very  flexible,  with  /f-Means  being  used  to  cluster  for  the 
first  iteration  if  the  training  set  is  very  large.  While  it  is  flexible,  it  does  require 
supervision.  If  the  classification  accuracy  drops  significantly  after  an  iteration,  it  could 
either  indicate  a  true  feature  deletion  or  that  necessary  centers  have  been  removed.  At 
this  point,  the  centers  from  the  previous  iteration  could  be  retained,  and  feature  selection 
can  proceed  without  clustering  until  it  is  determined  that  only  significant  features  remain. 

3.4.1  Experiment  3-3:  Simple  Feature  Selection  Test  Revisited 

Figures  3-5  and  3-6  demonstrate  the  effectiveness  of  this  clustering  algorithm 
applied  to  the  problem  described  in  Section  3.2.  The  performance  of  SNR*^®^^  used 
iteratively  with  feature  selection  performs  as  well  as  SNR  applied  to  the  FFNN  and 
Discriminant  Loadings  used  in  DA.  Table  3-2  illustrates  this.  The  average  feature 
rankings  are  identical,  and  SNR*^®^*^  made  only  one  more  mistake  in  ranking  than  SNR. 
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Select  Centers 


Figure  3-7.  Integrated  SNR'^^’^^/DBS  Feature  Selection  Algorithm 


Table  3-2.  Results  of  Feature  Selection  Test  w/  SNR*^®^^ 


Measures 

DA 

FFNN 

RBNN  w/o 
clust 

RBNN  w/ 
SNRRb^^ 

Average  Ranking, 
Feature  1 

1 

1.06 

1.72 

1.06 

Proportion  Feature  1. 
Ranked  First 

1 

0.96 

0.62 

0.94 

95%  Cl  Half- Width 

0.0582 

0.0621 

0.1539 

0.0753 
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AER  vs  Number  of  Features  Retained  for  Competing  Classifiers 


-I-  DA 
-e-  FFNN 

rbNN  w/o  dust 
^  RBNN  w/ SNR  dust 


Number  of  Features  Retained 


Figure  3-8.  AER  for  DA,  FFNTsI  and  RBNTsI  with  w/  and  no  clustering 


EiKpected  Ranking  of  Fe.atures 


-I-  DA 
-e-  FFNN 
-4-  RBNNw/o  dust 
^  RBNN  w/ SNR  dust 


Figure  3-9.  Average  Feature  Rankings  for  the  Four  Classifiers 


This  chapter  has  introduced  two  new  techniques:  derivative  based  saliency  feature 
selection,  and  signal-to-noise  ratio  clustering.  Without  clustering,  the  feature  selection 
routine  does  not  perform  well,  even  on  the  simple  problem  explored  in  Section  3.1. 
While  the  clustering  algorithm  performs  fairly  well,  approaching  the  performance  of  K- 
Means  as  the  sample  size  increases,  it  does  not  perform  better.  Also,  for  a  single 
iteration,  it  requires  redundant  work  (classification  is  performed  twice).  However,  when 
the  two  techniques  are  coupled,  they  provide  performance  equivalent  to  Discriminant 
Loadings  and  SNR.  These  results  are  only  for  a  simple  problem,  and  more  challenging 
problems  will  be  addressed  in  the  following  chapter. 
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4  Evaluation  of  Competing  Classifiers 


4.1  Overview 

This  chapter  will  evaluate  Discriminant  Analysis  (DA),  Feed  Forward  Neural 
Networks  (FFNN)  and  Radial  Basis  Neural  Networks  (RBNN)  applied  to  several 
challenging  problems.  The  first  problem  will  be  Block-C  addressed  in  Sections  1.1  and 
3.3.  The  second  application  will  be  the  University  of  Wisconsin  Breast  Cancer  data.  The 
final  application  will  be  the  classic  Fisher’s  Iris  Problem  with  noise  features  added.  The 
purpose  for  these  final  two  experiments  is  to  evaluate  the  efficacy  of  the  feature  selection 
algorithms  in  addition  to  classifier  performance.  The  analysis  techniques  in  Section  2.5 
will  be  used  to  compare  the  different  classifiers. 

4.2  Experiment  4-1:  Block-C  Classifier  Test 

DA,  FFNN  and  RBNN  will  be  applied  to  the  Block-C  problem.  For  the  first 
experiment,  240  training  points  and  100  validation  points  will  be  used.  Thirty  iterations 
will  be  performed,  with  the  average  Receiver  Operating  Characteristic  (ROC)  curves, 
Apparent  Error  Rate  (AER),  multinomial  test  statistics  and  mean  distance  metrics  being 
generated  for  each  classifier.  Figure  4-1  displays  the  average  ROC  curves,  and  Table  4-1 
shows  the  average  metrics  for  each  classifier.  RBNN  will  apply  SNR*^®^^  to  perform 
clustering  on  the  centers.  The  FFNN  will  use  eight  hidden  nodes,  and  will  use  40%  of 
the  training  data  for  internal  validation. 

The  RBNN  with  SNR*^™*^  clustering  significantly  outperforms  the  other  two 
classifiers  in  classification  accuracy.  Both  Artificial  Neural  Networks  (ANN)  perform 
much  better  than  DA  (which  performs  worse  than  just  guessing).  This  experiment  was 
repeated  for  training  set  sizes  of  480  and  960.  DA  and  FFNN  were  applied  identically. 
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while  RBNN  used  iT-Means  with  k=100  for  each  class  to  cluster  the  centers  for  both 


experiments. 


Average  ROC  Curves 


Figure  4-1.  Average  ROC  Curves  for  Block-C  Problem,  240  Training  Points 


Table  4-1.  Average  Metrics  for  Block-C  Problem,  240  Training  Points 


Measures 

DA 

FFNN 

RBNN 

AER 

0.643 

0.1613 

0.086 

90%  Cl  Half- Width 

0.0461 

0.0622 

0.0125 

Mean  Distance 

0.3437 

0.6286 

0.5385 

90%  Cl  Half- Width 

0.0383 

0.084 

0.0156 

Multinomial 

0.1377 

0.615 

0.2473 

90%  Cl  Half- Width 

0.0193 

0.1028 

0.0948 
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Prob  Detect 


Figure  4-2.  Average  ROC  Curves  for  Block-C  Problem,  480  Training  Points 


Table  4-2.  Average  Metrics  for  Block-C  Problem,  480  Training  Points 


Measures 

DA 

FFNN 

RBNN 

AER 

0.689 

0.046 

90%  Cl  Half- Width 

0.0263 

0.0095 

Mean  Distance 

90%  Cl  Half- Width 

Multinomial 

0.112 

90%  Cl  Half- Width 

0.0105 

WSSSm 

B9S9 
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Prob  Detect 


Average  ROC  Curves 


Figure  4-3.  Average  ROC  Curves  for  Block-C  Problem,  960  Training  Points 


Table  4-3.  Average  Metrics  for  Block-C  Problem,  960  Training  Points 


Measures 

DA 

FFNN 

RBNN 

AER 

0.706 

0.076 

0.027 

90%  Cl  Half- Width 

0.0303 

0.0179 

0.0083 

Mean  Distance 

0.348 

90%  Cl  Half- Width 

0.0143 

m 

i™ 

Multinomial 

0.686 

ign 

90%  Cl  Half- Width 

0.0472 

■B 
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Figures  4-2  and  4-3  display  the  respective  ROC  curves,  and  Tables  4-2  and  4-3 
show  the  metric  performance.  The  domination  in  ROC  curves  and  AER  continue  for  the 
RBNN,  although  the  FFNN  appears  to  be  converging.  The  other  metrics  however, 
identify  the  FFNN  as  the  better  classifier.  For  all  three  sample  sizes,  the  FFNN  has  a 
higher  mean  distance  metric,  although  for  240  training  points  the  difference  is  not 
significant  with  an  overall  a  =  0.1.  The  FFNN  also  perform  significantly  better  in  the 
multinomial  selection  metric  for  all  sample  sizes. 

4.2.1  Experiment  4-2:  Perturbed  Block-C  Classifier  Test 

Alsing  [1]  asserts  that  a  classifier  that  performs  better  for  mean  metric  distance 
will  be  more  robust  to  perturbations  in  the  data.  Under  this  hypothesis,  the  FFNN  will 
better  handle  changes  in  the  data  than  the  RBNN.  To  test  this,  the  three  experiments 
conducted  in  Section  3.2  were  repeated  with  the  validation  data  perturbed.  The 
validation  data  were  shifted  0. 1  in  both  dimensions.  Figures  4-4,  4-5  and  4-6  show  the 
averages  ROC  curves  for  the  three  elassifiers  applied  to  the  different  sample  sizes. 

Tables  4-4,  4-5  and  4-6  show  the  average  metries  for  the  three  experiments. 

For  the  training  size  of  240  exemplars,  the  mean  distance  metric  was  not 
significantly  different  for  RBNN  and  FFNN.  While  not  statistically  significant,  FFNN 
still  performed  better  in  this  metric.  Figure  4-4  and  Table  4-4  show  that  the  FFNN 
reacted  better  to  the  perturbed  data.  The  difference  in  AER  is  no  longer  significant,  and 
the  ROC  curves  now  overlap.  Although  the  mean  distance  is  still  not  significant,  the 
multinomial  statistic  is  significant.  It  is  concluded  that  the  FFNN  is  the  best  classifier  for 
this  perturbed  problem. 

This  performance  is  repeated  for  the  sample  sizes  of  480  and  960.  Figures  4-5 
and  4-6  show  that  the  ROC  curves  for  the  FFNN  now  dominate  the  RBNN  curves. 

Tables  4-5  and  4-6  show  the  AER  is  no  less  for  the  FFNN,  although  it  is  still  statistically 
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Prob  Detect 


insignificant.  Both  the  mean  distance  metric  and  the  multinomial  statistic  indicate  that 
FFNN  performs  better  than  the  RBNN.  It  is  concluded  that  the  FFNN  is  more  robust  to 
perturbations  in  the  validation  data,  and  is  a  better  classifier  for  the  perturbed  problem. 


0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1 

Prob  False  Alarm 


Figure  4-4.  Average  ROC  Curves  for  Perturbed  Block-C,  240  Training  Points 


Table  4-4.  Average  Metrics  for  Perturbed  Block-C,  240  Training  Points 

_ Measures _ DA _ FFNN  RBNN 

AER  0.672  0.338  0,323 

90%  Cl  Half- Width  0.0268  0.0257  0.0243 

Mean  Distance  0.3501  0.3823  0.2284 

90%  Cl  Half- Width  0.0383  0.0506  0.0135 

Multinomial  0.324  0.4877  0.1883 

90%  Cl  Half- Width  0.0226  0.0517  0.0556 


Prob  Detect 


Figure  4-5.  Average  ROC  Curves  for  Perturbed  Block-C,  480  Training  Points 


Table  4-5.  Average  Metrics  for  Perturbed  Block-C,  480  Training  Points 


Measures 

DA 

FFNN 

RBNN  1 

AER 

90%  Cl  Half- Width 

Mean  Distance 

0.428 

90%  Cl  Half- Width 

0.0435 

KlHH 

Multinomial 

0.124 

90%  Cl  Half- Width 

KBaH 

0.0309 
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Average  ROC  Curves 


Figure  4-6.  Average  ROC  Curves  for  Perturbed  Block-C,  960  Training  Points 


Table  4-6.  Average  Metrics  for  Perturbed  Block-C,  960  Training  Points 


Measures 

DA 

FFNN 

RBNN 

AER 

0.353 

90%  Cl  Half- Width 

0.0214 

Mean  Distance 

0.4252 

90%  Cl  Half- Width 

IBS 

0.029 

Multinomial 

0.376 

wmn 

90%  Cl  Half- Width 

0.0236 

mmSM 
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4.3  University  of  Wisconsin  Breast  Cancer  Data 

The  University  of  Wisconsin  Breast  Cancer  Data  (UWBCD)  set  obtained  from  the 
University  of  California- Irvine  [18]  consists  of  699  tissue  samples.  241  exemplars  were 
malignant  (Class  1)  and  458  were  benign  (Class  2).  Each  exemplar  contained  nine 
features:  clump  thickness,  uniformity  of  cell  size,  uniformity  of  cell  shape,  marginal 
adhesion,  single  epithelial  cell  size,  bare  nuclei,  bland  chromatin,  normal  nuclei,  and 
mitoses.  Alsing  [1]  produced  feature  rankings  by  applying  SNR  to  the  data.  Bare  nuclei 
and  cell  thickness  were  the  most  significant,  and  mitoses  and  single  epithelial  cell  size 
were  the  least  significant. 

4.3.1  Experiment  4-3:  UWBCD  Classifier  Comparison 

For  this  experiment,  the  three  classification  techniques  were  applied  to  the  data 
set  to  include  all  nine  features.  The  training  set  consisted  of  350  exemplars,  with  349 
exemplars  held  out  for  the  validation  set.  The  FFNN  used  1 8  hidden  nodes  and 
partitioned  the  training  set  into  210  training  and  140  training  test  exemplars.  The  RBNN 
used  SNR*^®^^  to  cluster  the  data  which  reduced  the  number  of  centers  from  350  to  240. 

Figure  4-7  displays  the  ROC  Curves  for  the  three  classifiers  and  Table  4-7  shows 
the  metrics  for  this  experiment.  Analysis  of  the  ROC  Curve  and  the  AER  yields  no 
significant  difference  between  the  classifiers.  There  is  no  significant  difference  between 
the  FFNN  and  DA  for  the  multinomial  selection  metric,  but  both  perform  significantly 
better  than  the  RBNN.  The  FFNN  does  perform  significantly  better  than  both  the  RBNN 
and  DA  for  the  mean  distance  metric  and  should  be  more  robust  to  perturbations. 
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Figure  4-7.  ROC  Curves,  UWBCD,  9  Features 


Table  4-7.  Metrics,  UWBCD,  9  Features 


Measures 

DA 

FFNN 

RBNN 

AER 

0.0372 

0.043 

0.0372 

90%  Cl  Half- Width 

0.0243 

0.0260 

0.0243 

Mean  Distance 

0.6334 

0.9129 

0.6659 

90%  Cl  Half- Width 

0.0361 

0.0164 

0.043 

Multinomial 

0.5043 

0.4585 

0.0372 

90%  Cl  Half- Width 

0.0641 

0.0639 

0.0243 
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4.3.2  Experiment  4-4:  Perturbed  UWBCD  Classifier  Comparison 

This  next  experiment  tests  the  hypothesis  that  the  FFNN  will  be  more  robust  by 
perturbing  the  validation  data  set.  The  perturbation  was  accomplished  by  adding  random 
draws  from  a  normal  population  with  mean  zero  and  standard  deviation  of  two  to  bare 
nuclei  and  clump  thickness  for  each  exemplar  in  the  validation  set.  The  partitioning  of 
the  data  and  the  application  of  the  classifiers  was  identical  to  Experiment  4-3.  Figure  4-8 
illustrates  the  ROC  Curves  and  Table  4-8  displays  the  metric  performance  for  the  three 
classifiers  against  this  perturbed  data.  The  FFNN  clearly  dominates  the  RBNN  and  DA 
in  all  categories.  The  FFNN  was  decidedly  more  robust  to  the  changes  in  the  validation 
set. 

4.3.3  Experiment  4-5:  UWBCD  Feature  Selection  Test 

For  this  last  experiment,  seven  features  were  added  to  the  data  set.  Five  features 
were  noise  variables  uniformly  distributed  between  zero  and  one.  The  remaining  two 
additional  features  were  redundant  features,  being  slight  modifications  of  two  existing 
features,  bare  nuclei,  a  significant  feature,  and  mitoses,  a  relatively  insignificant  feature. 
These  features  were  slightly  perturbed  to  allow  for  DA  to  work.  If  the  features  were 
identical,  the  inverse  of  the  covariance  matrix  would  not  exist,  and  DA  could  not  be 
applied.  These  feature  were  modified  by  adding  random  draws  from  a  Normal(0,0.04) 
population  to  each  exemplar’s  features. 

The  three  feature  selection  techniques.  Discriminant  Loadings,  signal-to-noise 
ratio  (SNR)  and  derivative-based  saliency  coupled  with  SNR*^®’^'^  clustering  were  applied 
to  the  data.  Classification  was  performed  as  each  feature  was  removed.  Figure  4-9 
shows  the  AER  plotted  against  the  number  of  features  remaining.  The  minimum  AER 
was  chosen  as  the  ideal  termination  point  for  each  classifier,  and  the  resultant  ROC 
curves  and  metric  performance  are  given  in  Figure  4-10  and  Table  4-9. 
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Figure  4-8.  ROC  Curves,  Perturbed  UWBCD,  9  Features 


Table  4-8.  Metrics,  Perturbed  UWBCD 


Measures 

DA 

FFNN 

RBNN 

AER 

0.3438 

0.0917 

0.2292 

90%  Cl  Half- Width 

0.0609 

0.0370 

0.0539 

Mean  Distance 

0.5504 

0.7537 

0.4384 

90%  Cl  Half- Width 

0.0319 

0.0141 

0.0306 

Multinomial 

0.1318 

0.8052 

0.063 

90%  Cl  Half- Width 

0.0433 

0.0508 

0.0311 
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DA  terminated  with  four  features  remaining.  All  the  noise  features  were 
removed,  but  six  of  the  real  features  were  also  removed.  The  addition  of  the  new  features 
caused  DA  to  perform  significantly  worse,  even  at  its  optimal  point.  The  FFNN  fared 
much  better,  removing  all  five  noise  features.  Only  one  original  feature  was  retained, 
mitoses,  and  its  removal  did  not  impact  classification  accuracy.  The  RBNN  retained 
three  noise  features  and  both  redundant  features  at  its  terminating  point  of  ten  features 
retained.  At  this  point,  significant  features  were  removed  prior  to  the  removal  of  the 
noise  features.  clustering  was  performed  for  the  first  two  iteration  before 

further  clustering  affected  classification  accuracy.  The  number  of  centers  was  first 
reduced  to  224  and  finally  to  75. 

The  FFNN  and  the  RBNN  were  not  signifieantly  impacted  by  the  noise  and 
redundant  features.  The  AER  with  all  16  features  ineluded  is  not  significantly  worse  than 
at  their  optimal  point  for  both  networks.  Both  networks  perform  significantly  better  than 
DA  at  its  optimal  point.  There  are  no  signifieant  differenees  between  the  ROC  Curves 
and  AER  for  the  FFNN  and  the  RBNN.  However,  the  FFNN  performs  significantly 
better  in  the  mean  distanee  and  multinomial  seleetion  metrics.  The  FFNN  performs 
feature  seleetion  best,  and  is  also  the  best  elassifier  for  Experiment  4-5. 

Table  4-9.  Metries  for  UWBCD  Feature  Selection  Test 


Measures 

DA 

FFNN 

RBNN 

Features  Retained 

4 

9 

10 

Noise  Features  Retained 

0 

0 

3 

Redundant  Features  Retained 

1 

1 

2 

AER 

mmmm 

0.0458 

90%  Cl  Half- Width 

iil 

0.0268 

Mean  Distance 

0.5961 

mBsm 

0.6042 

90%  Cl  Half- Width 

0.0342 

iBil 

0.0421 

Multinomial 

0.0831 

0.063 

90%  Cl  Half- Width 

0.0354 

m 

0.0311 
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AER  vs  Number  of  Features  Retained  for  Competing  Classifiers 


-I-  DA 
^  FFNN 
^  RBNN 


Number  of  Features  Retained 


Figure  4-9.  AER  vs.  Number  of  Features  Retained,  Experiment  4-5 
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Figure  4-10.  ROC  Curves  for  Optimal  Stopping  Point,  Experiment  4-5 
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4.4  Experiment  4-6:  Noise-Corrupted  Fisher’s  Iris  Feature  Selection  Test 

Bauer  et.  al.  [4]  present  a  noise-corrupted  version  of  Fisher’s  classic  Iris  problem. 
This  data  consist  of  148  exemplars  belonging  to  three  classes,  with  50  exemplars  in  Class 
1  and  49  each  in  Class  2  and  Class  3.  Each  exemplar  has  eight  features  with  the  first  four 
features  being  the  original  features  of  sepal  length,  sepal  width,  petal  length  and  petal 
width.  The  final  four  features  are  noise  features  generated  as  random  permutations  of  the 
four  real  features.  Bauer  et.  al.  determined  that  petal  width  and  petal  length  are  the  only 
features  required  for  optimal  classification  accuracy.  The  feature  selection  techniques 
will  be  evaluated  against  these  criteria. 

4.4.1  Classification  for  the  Three  Class  Problem 

Prior  to  conducting  the  experiment,  the  classification  techniques  discussed 
previously  must  be  discussed  as  they  apply  to  this  problem.  All  of  the  techniques 
discussed  in  Chapter  2  and  Chapter  3  are  predicated  on  classification  for  a  two-class 
problem.  Before  applying  these  techniques  to  a  problem  with  three  (or  more)  classes, 
some  adaptations  are  required.  Only  minor  changes  are  required  to  the  actual 
classifications  for  the  Artificial  Neural  Networks  (ANN)  and  no  changes  are  necessary  to 
generate  the  quadratic  discriminant  scores.  The  ANN’s  require  three  output  nodes, 
instead  of  the  one  necessary  for  the  two-class  problem.  Instead  of  training  the  network  to 
one  for  Class  1  and  zero  for  Class  2,  the  network  changes  to  the  vectors  [1,0,0]  for 
exemplars  in  Class  1,  [0,1,0]  for  Class  2  and  [0,0,1]  for  Class  3.  An  exemplar  is 
classified  in  the  class  corresponding  to  the  node  with  the  largest  output. 

Most  of  the  differences  between  the  two-class  and  three-class  problems  involve 
feature  selection.  For  DBS,  there  are  now  three  measures  for  each  exemplar,  one  for 
each  output  differing  only  in  the  weight  that  is  applied  to  the  different  nodes.  The 
measure  now  becomes  the  average  of  the  absolute  value  of  the  individual  measures 
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(4.1) 


where  DS^^^ik  is  the  saliency  measure  describe  in  Equation  (3.2)  applied  to  the 
exemplar.  Discriminant  Loadings  require  more  of  an  adjustment.  Equation  (2.1 1)  uses 
the  b  defined  in  Equation  (2.6)  to  generate  the  loadings.  This  definition  of  b  is  only  valid 
for  two-class  problems.  Laine  [11]  recommends  estimating  b  for  each  class 

(4.2) 

where  i^is  the  sample  covariance  matrix  for  the  whole  population  and  the  sample  mean 
for  the  class.  These  b,  are  substituted  directly  for  b  in  Equation  (2.11).  The  loading 
for  the  k‘^  feature  becomes  the  maximum  (in  absolute  value)  of  the  class  loadings. 

Some  of  the  evaluation  methods  described  in  Section  2-5  also  need  to  be  adjusted 
and  some  of  the  methods  cannot  be  applied  to  the  three-class  problem.  Confusion 
Matrices  (CM)  and  AER  are  generated  in  the  same  manner  as  for  the  two-class  problem, 
except  that  there  are  nine  distinct  outcomes  rather  than  four.  This  difference  in 
composition  of  the  CM  prevents  the  construction  of  a  true  ROC  Curve,  and  consequently 
the  mean  distance  metric  is  unavailable.  The  multinomial  selection  procedure  is 
available  however,  with  only  minor  changes.  The  posterior  probabilities  for  DA  are 
calculated  by  applying  Equation  (2.38),  except  that  the  denominator  is  now  the  sum  of 
the  three  quadratic  discriminant  scores.  The  posterior  probabilities  for  the  ANN’s  are 
even  simpler  than  those  described  in  Section  2-5-3.  The  posterior  probabilities  for  each 
class  are  the  outputs  (in  the  case  of  RBNN’s,  these  outputs  are  standardized  to  the 
interval  [0,1])  for  the  corresponding  node  of  the  trained  networks.  The  evaluation  of  this 
three-class  problem  will  entail  comparison  of  the  feature  selection  techniques  in 
parsimony  and  the  general  classification  will  be  evaluated  using  AER  and  the 
multinomial  selection  criteria. 
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4.4.2  Results  for  the  Noise-Corrupted  Iris  Problem  Feature  Selection  Test 


The  Fisher’s  Iris  data  were  divided  into  75  exemplars  for  training  and  73  for 
validation.  Fifteen  of  the  training  exemplars  were  allotted  for  internal  validation  for  the 
FFNN.  Additionally,  the  FFNN  used  twelve  hidden  nodes.  The  RBNN  began  with  75 
centers  which  were  reduced  to  three  after  the  first  five  iterations.  The  results  of  this 
experiment  are  given  in  Figure  4-1 1  and  Table  4-10.  The  optimal  stopping  point  for  the 
RBNN  and  FFNN  was  with  two  features  remaining,  petal  width  and  petal  length,  with 
petal  width  being  the  most  salient  feature.  The  optimal  feature  set  for  DA  included  these 
two  features  plus  sepal  length.  All  three  feature  selection  techniques  produced  similar 
feature  sets  and  identical  estimates  of  the  AER.  The  optimal  FFNN  however, 
significantly  outperformed  DA  and  the  RBNN  in  the  multinomial  selection  criteria.  This 
result  is  consistent  with  the  previous  experiments. 


Table  4-10.  Metrics  for  Optimal  Classifiers,  Experiment  4-6 


Measures 

DA 

FFNN 

RBNN 

Features  Retained 

3 

2 

2 

Noise  Features  Retained 

0 

0 

0 

AER 

0.0137 

90%  Cl  Half- Width 

0.0326 

Multinomial 

0.0000 

mm 

90%  Cl  Half- Width 

0.0543 

il 
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AER  vs  Number  of  Features  Retained  for  Competing  Classifiers 


Figure  4-11.  AER  vs.  Number  of  Features  Retained,  Experiment  4-6 

In  this  chapter  three  primary  problems  were  explored:  Block-C,  the  University  of 
Wisconsin  Breast  Cancer  Data  set  and  Fisher’s  Iris  problem.  For  all  problems  RBNN’s 
perform  at  least  as  well  as  FFNN’s  in  AER  and  in  the  ROC  Curves.  However,  the 
FFNN’s  performed  consistently  better  in  the  mean  distance  and  multinomial  selection 
metrics.  For  this  reason,  the  FFNN’s  performed  significantly  better  than  the  RBNN’s 
when  applied  to  the  perturbed  data  sets.  For  the  two  feature  selection  tests.  Experiment 
4-5  and  Experiment  4-6,  the  integrated  architecture  and  feature  selection  algorithm  for 
the  RBNN  performed  as  well  as  Discriminant  Loadings  and  SNR. 
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5  Summary  and  Recommendations 


5.1  Overview 

This  chapter  will  summarize  the  existing  techniques  presented,  as  well  as  the 
newly  developed  algorithms,  for  solving  an  integrated  architecture  design  and  feature 
selection  problem  for  radial  basis  neural  networks.  Additionally,  this  chapter  will 
highlight  the  major  contributions  of  the  thesis  and  give  recommendations  for  significant 
areas  of  future  research. 

5.2  Summary  of  Techniques 

This  thesis  presented  several  feature  selection  techniques  including  Discriminant 
Loadings  applied  to  Discriminant  Analysis  (DA)  and  signal-to-noise  ratio  (SNR)  applied 
to  Feed  Forward  Neural  Networks  (FFNN).  Clustering  techniques  for  Radial  Basis 
Neural  Networks  (RBNN)  were  also  discussed.  The  two  techniques  applied  to  the 
experiments  were  Ai-Means  and  Radial  Basis  Function  Iterative  Construction  Algorithm 
(RICA).  Chapter  3  developed  three  additional  techniques  for  RBNN’s.  The  first 
technique  was  feature  selection  using  derivative-based  saliency  (DBS).  The  second 
technique  was  a  new  clustering  algorithm,  SNR*^®*^^  used  for  architecture  selection  in 
RBNN’s.  These  techniques  were  combined  to  form  the  integrated  architecture  and 
feature  selection  algorithm  which  alternates  between  clustering  and  feature  selection  until 
the  appropriate  centers  and  features  are  retained.  Table  5-1  details  the  techniques  and 
Table  5-2  illustrates  to  which  experiments  they  were  applied. 

Four  analysis  techniques  were  also  discussed  in  this  thesis:  Actual  Error  Rate 
(AER),  visual  inspection  of  the  Receiver  Operating  Characteristic  (ROC)  Curves,  the 
mean  distance  metric,  and  the  multinomial  selection  procedure.  These  techniques  were 
applied  to  the  experiments  to  evaluate  the  competing  classifiers. 
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Table  5-1.  Description  of  Classification  Techniques 


Technique 

Classifier 

Application 

Description 

DL 

DA 

Feature 

Selection 

Discriminant  Loadings  -  Measures  the  correlation 
between  the  output  and  features 

SNR 

FFNN 

Feature 

Selection 

Signal-to-Noise  Ratio  -  A  weight-based  saliency  measure 
contrasting  features  to  a  noise  feature 

DBS 

RBNN 

Feature 

Selection 

Derivative-Based  Saliency  -  Measures  the  unit  change  in 
the  output  with  respect  to  the  feature 

SjyfjjRBNN 

RBNN 

Architecture 

Selection 

Signal-to-Noise  Ratio  Clustering-  A  weight-based 
clustering  algorithm  contrasting  centers  to  a  noise  center 

A-Means 

RBNN 

Architecture 

Selection 

K-Means  Clustering  Algorithm  -  A  clustering  algorithm 
using  Euclidean  distance 

RICA 

RBNN 

Architecture 

Selection 

Radial  Basis  Function  Iterative  Construction  Algorithm  - 
A  clustering  algorithm  using  Mahalanobis  distance 

Table  5-2.  Summary  of  Experiments  and  Techniques. 


Experiment 

Data  Set 

Purpose 

Techniques  I 

DA 

FFNN 

RBNN  1 

DL 

SNR 

A'-Means 

RICA 

SnrRbnn 

DBS 

SnrRbnn 

+  DBS 

Experiment 

3-1 

Simple 

Noise 

Feature 

Selection 

X 

X 

X 

X 

Experiment 

3-2 

Block-C 

Clustering 

■ 

X 

X 

X 

Experiment 

3-3 

Simple 

Noise 

Feature 

Selection 

X 

X 

X 

Experiment 

4-1 

Block-C 

Classifier 

Comparison 

■ 

X 

X 

Experiment 

4-2 

Perturbed 

Block-C 

Classifier 

Robustness 

■ 

X 

X 

Experiment 

4-3 

UW  BCD 

Classifier 

Comparison 

■ 

X 

Experiment 

4-4 

Perturbed 
UW  BCD 

Classifier 

Robustness 

■ 

X 

Feature 

Selection 

X 

X 

X 

Experiment 

4-6 

Noisy 

Iris 

Feature 

Selection 

X 

X 

X 
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5.3  Summary  of  Contributions 

The  major  contribution  of  this  thesis  is  an  integrated  architecture  and  feature 
selection  algorithm  for  RBNN’s.  The  performance  of  this  algorithm  was  comparable  to 
Discriminant  Loadings  for  DA  and  SNR  for  FFNN’s.  It  also  significantly  reduced  the 
number  of  centers  required  for  optimal  classification.  Incorporating  for 

architecture  selection  and  DBS  for  feature  selection  provides  a  viable  feature  selection 
routine  for  RBNN’s  which  is  not  currently  in  existence.  Additionally,  a  new  clustering 
algorithm  was  developed  that  uses  the  network  to  determine  the  necessary  architecture. 
The  new  integrated  algorithm  is  suitable  for  any  classification  problem.  Examples  of 
potential  application  areas  include  the  classification  of  failure  modes  from  sensor  data  on 
various  aircraft  components,  classifying  individuals  as  pass  or  fail  for  pilot  training,  and 
discriminating  targets  from  clutter  for  target  recognition  systems. 

5.4  Conclusions 

There  are  several  general  conclusions  that  can  be  drawn  from  this  research.  This 
thesis  highlights  the  need  for  feature  selection,  and  illustrates  why  the  development  of 
feature  selection  for  RBNN’s  is  important.  Experiment  3-3  illustrated  the  effect  of  noise 
on  classification  accuracy.  For  all  classifiers  considered,  the  AER  is  significantly  worse 
for  the  data  with  a  large  number  of  noise  features  versus  the  data  with  only  the  true 
feature.  This  effect  is  more  pronounced  in  the  absence  of  strong  features.  Experiment  3- 
3  has  significant  overlap  between  the  two  classes  with  a  minimum  error  rate  of 
approximately  16%.  Experiment  4-5  has  less  inherent  error,  and  Experiment  4-6  has 
features  which  will  almost  perfectly  discriminate  between  the  three  populations.  For 
these  latter  two  experiments  the  noise  does  not  negatively  impact  classification  accuracy 
for  the  Artificial  Neural  Networks.  The  AER  for  DA  is  significantly  worse  for  the  noise 
corrupted  data  in  Experiment  4-5,  but  not  nearly  as  much  as  in  Experiment  3-3.  For 
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Experiment  4-6,  the  effect  of  noise  on  the  AER  is  eliminated.  These  experiments 
illustrate  the  need  of  feature  selection  in  the  absence  of  strong  features,  particularly  for 
DA. 

This  research  also  highlights  the  variable  performance  of  the  classifiers  across  the 
different  experiments.  FFNN’s  and  RBNFJ’s  are  consistently  the  top  performers  for  all 
the  applications.  DA,  while  performing  as  well  as  the  ANN’s  in  Experiments  4-3  and  4- 
6,  performed  significantly  worse  than  the  ANN’s  in  all  measures  for  the  other 
experiments.  The  performance  of  FFNN’s  and  RBNlM’s  are  similar  with  two  important 
distinctions:  1)  RBNN’s  outperform  FFNFJ’s  in  AER  for  the  geometric  Block-C  problem 
of  Experiment  4-1,2)  the  ROC  Curves  for  the  RBNN’s  dominate  the  FFNN  across  the 
training  set  sizes.  For  this  problem,  the  RBTsIlSI  outperforms  the  FFNN. 

While  the  RBNN’s  perform  better  than  the  FFNN  in  AER  in  Experiment  4-1  and 
comparably  for  the  other  experiments,  FFNN’s  eonsistently  perform  better  in  the  mean 
distanee  and  multinomial  seleetion  metries.  The  FFNN  provides  more  confidence  in  the 
classification  results  than  DA  and  RBNN’s  for  all  the  applications  in  this  thesis.  The 
impact  of  the  performanee  in  the  mean  distanee  metrie  is  illustrated  in  Experiments  4-2 
and  4-4  where  the  validation  set  is  perturbed.  In  both  instances,  the  FFNN’s  outperform 
the  other  two  classifiers.  Of  partieular  interest  is  Experiment  4-2  in  which  the  FFNN’s 
outperform  the  RBNN’s  for  the  perturbed  data  set,  while  the  RBNN’s  outperform  the 
FFNN’s  for  the  standard  data.  These  results  indicate  a  fundamental  difference  in  the 
problems  best  suited  for  the  ANN’s.  RBNN’s  are  better  suited  for  applications  where  the 
validation  set  is  distributed  identically  to  the  training  set  and  no  deviations  are  expected 
for  new  data.  FFNN’s  are  more  resistant  to  these  deviations  and  are  better  suited  to 
applications  where  the  new  exemplars  might  change  in  time.  This  is  particularly  true  of 
problems  involving  human  data  that  are  to  be  applied  in  the  long  run. 
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5.5  Recommendations  for  Future  Research 


The  results  of  this  research  identify  many  fruitful  areas  of  future  research.  Since 
most  of  the  work  performed  in  this  thesis  was  experimental  in  nature,  it  would  be 
instructive  to  test  the  algorithm  on  problems  other  than  the  four  discussed  herein. 
Through  additional  experimentation,  it  may  be  possible  to  gain  further  insight  into  the 
performance  of  the  integrated  algorithm  as  compared  to  existing  techniques. 

Second,  it  may  be  possible  to  improve  upon  the  procedure  for  selecting  the 
number  and  location  of  the  centers.  In  particular,  this  may  be  accomplished  by  training 
the  centers  as  in  [12].  Implementing  this  approach,  in  conjunction  with  derivative-based 
saliency,  should  be  more  computationally  efficient. 

Finally,  the  empirical  results  provide  some  insight  into  theoretical  relationships 
between  the  signal-to-noise  ratio  clustering  algorithm  and  the  K-means  clustering 
approach.  It  would  be  instructive  to  explore  this  relationship  analytically  to  determine  if, 
in  fact,  the  ROC  curves  for  the  two  approaches  converge  or  if  this  is  simply  an  artifact  of 
the  data  sets  considered. 
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Appendix  A.  Derivation  of  Derivative-Based  Saliency  for  RBNN’s 


The  network  output,  z,  of  the  exemplar  is 


z^'^  ='^Wj  exp| 

7=1 


2(7  j  k=i 


(A.l) 


where  wj  is  the  weight  of  the  /*  center,  p  is  the  number  of  nodes,  is  the  component 

of  the  center  and  m  is  the  number  of  features.  This  can  also  be  written  as 


=^W/f]exp 

j=]  k=i 


2^7 


(A.2) 


The  partial  derivative  of  this  output  with  respect  to  feature  /  becomes 


3x, 


j=]  k=]  k=\ 


2(7, 


(A.3) 


Applying  the  chain  rule,  this  becomes 


r(‘) 


p  m 


d—^Vw.y 
ax,  tr 


dxi 


exp 


2tr/ 


fjexp 


,  «=i 
yq^k 


2(T^ 


M 


(A.4) 


For  l^k 


dx. 


exp 


2(7 


=  0 


(A.5) 


For  /  =  k 


dx. 


exp 


2(7, 


exp 


2cr, 


(A.6) 


Therefore,  Equation  (App.3)  becomes  the  result  seen  in  Equation  (3.2) 


ax,  —  <T,.- 


k=\ 


2(j, 


{A.l) 
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Appendix  B.  Derivation  of  Derivative-Based  Saliency  for  GRNN’s 


The  DBS  measures  for  GRNN’s  are  obtained  in  a  similar  fashion  to  RBNN’s. 
The  network  output  for  the  exemplar  is 


Jw^.exp 


Zexp 


2(7/  s 


(B.l) 


This  is  the  sum  of  the  weighted  hidden  outputs,  z''^  scaled  by  the  unweighted  hidden 
outputs,  The  partial  derivatives  of  this  expression  with  respect  to  the  feature  is 
obtained  by  using  the  quotient  rule,  and  is  given  by 


a  zW  ax, 


dx, 


(B.l) 


The  partial  derivative  of  z'^^  is  given  in  Equation  (A. 7),  with  the  partial  derivative  of 
differing  only  in  the  absence  of  the  weights.  Therefore,  the  saliency  measure  is 


(B.3) 


where 


hy  =exp  — 

2(7 ,  ~  ~ 


(B.4) 
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The  research  contribution  of  this  thesis  is  the  first  known  integrated  architecture  and  feature  selection  algorithm  for  Radial  Basis  Neural  Networks 
(RBNN’s).  The  objective  is  to  apply  the  network  iteratively  to  determine  the  final  architecture  and  feature  set  used  to  evaluate  a  problem. 
Additionally,  this  thesis  compares  three  different  classification  techniques,  Discriminant  Analysis  (DA),  Feed-Foiward  Neural  Networks  (FFN)  and 
RBNN’s  against  several  hard  to  solve  problems.  These  problems  were  used  to  evaluate  general  classifier  performance  as  well  as  the  perfoimance  of 
the  feature  selection  techniques. 


This  thesis  describes  the  classification  techniques  as  well  as  the  measures  used  to  evaluate  them.  It  next  develops  a  new  clustering  technique  used  to 
detennine  the  network  architecture  and  the  saliency  measure  used  to  select  features  for  RBNN’s.  Next,  the  thesis  applies  these  techniques  to  three 
general  problems,  Block-C,  the  University  of  Wisconsin  Breast  Cancer  Data  (UWBCD)  and  a  noise  corrupted  version  of  Fisher’s  Iris  problem. 
Finally,  the  conclusions  and  recommendations  for  future  research  are  provided. _ 
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